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1. INTRODUCTION 

Face recognition technology has been replacing the human’s role in recognizing faces. The face- 
recognizing equipment receives face images or videos containing human faces as input, of which biometric 
facial data is subsequently extracted and processed to conduct the recognizing task [1]. Because the facial 
features of an individual are unique [2], they have been acknowledged as an effective means for security 
purposes, e.g., alternating the use of passwords and identity cards, and allowing authorized access. Among 
popular face recognition models which have been developed by universities and companies, there are 
VGGFace [3], DeepFace [4], [5], OpenFace [6], and FaceNet [7]. In [8], a face recognition model based on 
the histogram of oriented gradients (HOG) and support vector machine (SVM) classifier was investigated. 
Besides, in [9], a method based on the AdaBoost algorithm was used to train cascade classifiers with feature 
types such as the HOG and the Haar-like. Although a better performance was achieved, it is computationally 
demanding as it includes a number of weak classifiers. 

On the other hand, direct training operations on faces can be challenging owing to the face 
occlusion, which is common in practice. To overcome this issue, Zhang ef al. [10] have based on the 
Bayesian framework to propose an algorithm that locates the head using the Omega-liked shape formed by 
the head-shoulder part of a person. This technique has been applied widely in automatic teller machines 
(ATM). Additionally, in [11], the face-recognizing task was carried with deformable part models (DPM) 
yielding remarkable results, though it requires heavy computational resources. The DPM-based system was 
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deployed as well in [12], which offers a reduction in error rate and false-negative face detection. Nonetheless, 
this technique is limited by the usage of front-view facial images, thus, is not universal. 

Recent years have witnessed the rise of convolutional neural network (CNN) application in face 
detection. Deep CNN (DCNN) [13], region based convolutional neural networks (R-CNN) [14], and another 
one-or two-stage deep CNN-based systems such as VGGNet [15] and ResNet [16] have showcased their 
outstanding performance in comparison with their conventional counterparts. However, as there are more 
convolutional layers added to the CNN, the detecting speed is reduced considerably. To overcome this issue, 
a number of multi-stage face-detecting algorithms have been investigated, for example, the funnel-structured 
cascade (FuSt) [17], the pyramid-based cascade model that distills knowledge online and mines hard sample 
offline [18], which deliver outstanding true positive rate and performance in real-time. 

CNNs are driven with data as they are trained with the extracted features and face classification. 
Additionally, CNNs which are trained with 2D facial data could further be tuned with 3D one for potentially 
better recognition accuracy. Tornincasa [19] showcased how the pertinent discriminating features from the 
query faces can be extracted by the use of differential geometry. Dagnes et al. [20] have investigated an 
algorithm that can compute the optimized marker layout to capture the face movement. To deal with the 
different facial expressions and illumination, radon and wavelet transforms were combined in [21] for the 
nonlinear feature extraction. Notably, a so-called DeepID model, which is constructed of a large number of 
CNNs, and its extension were proposed in [22]-[24] with a better feature extracting capability. This is 
realized thanks to the fact that they can process a variety of face positions and facial patches. 

In this paper, we designed a face recognition system based on the FaceNet model with SVM 
classifier. Then, we compare the accuracy of our proposed method with two other face recognition methods 
operating on three public datasets to increase the generalization of the study. Finally, the paper showcases 
how to integrate the system into a web-based timekeeping application. The obtained results regarding the 
system performance and its implementation are highly applicable for both the research purpose and the 
practical usage. 


2. MATERIALS AND METHODS 
2.1. Multi-task cascaded convolutional neural networks (MTCNN) 

MTCNN framework detects and aligns faces with unified cascaded CNNs. MTCNN is tasked with 
three outputs. Firstly, it has to classify whether an input is a face or non-face. Then it has to perform the 
bounding box regression, and finally localizes the facial landmark. Each layer uses the intakes the output 
from its preceding layer and in the end, the overall learning target is summed up. Corresponding to these 
tasks, the MTCNN is constructed of three layers which are in order so-called the proposal network (P-Net), 
the refine network (R-Net), and the output network (O-Net). The architectures of MTCNN are shown in 
Figure |. 

Layer 1(P-Net) is a fully convolutional network (FCN), which is used to generate the candidate 
windows and their corresponding bounding box regression vectors. P-Net combines the overlapping areas of 
the bounding box vectors to reduce the candidate volume. Layer 2 (R-Net) is a CNN which is differentiated 
from FCN as its last stage is denser. R-Net intakes the output of P-Net, screens out the false candidates, 
calibrates with bounding box regression, and merges overlapping candidates using non-maximum 
suppression (NMS). Layer 3 (O-Net) functions in a similar manner to the R-Net. However, it describes in 
more detail the faces and delivers five positions of the facial landmarks being the left/right eyes, nose, and 
left/right mouth corners. 


2.2. FaceNet model 

Facenet takes an image of the person's face as input and output it into the 128-dimensional euclidean 
space. The distances of a person’s face images would be comparatively closer than that of other random ones. 
In general, there are two different CNN-based basic architectures in Facenet. The first category adds 1<1xd 
convolutional layers between the standard convolutional layers of the Chen et al. [25] architecture, then gets 
a model 22 layers NN1 model. The second category consists of Inception models based on GoogLeNet [26]. 
The inception module contains 4 branches from the left to right. It employs convolution with 1x1 filters as 
well as 3x3 and 5x5 filters and a 3x3 max-pooling layer. Each branch uses a 1x1 convolution to achieve time 
complexity reduction. FaceNet model is a DCNN trained via a triplet loss technique that allows vectors for 
the same identity to become more similar, while vectors for different identities should become less similar. 


2.3. Triplet loss 

The face recognition model is trained with batches of data, each has three images being the anchor, 
the positive, and the negative images. Specifically, Figure 2 illustrates how the triplet loss operates by 
maximizing the anchor-negative image distance and minimize the anchor-positive one. Notably, an image is 
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considered positive if it has the same identity as the anchor and vice versa for the negative image. Thanks to 
this mechanism Triplet loss has been considered one of the best effective ways for learning face image 128-D 
encodings. Notably, an image is considered positive if it has the same identity as the anchor and vice versa 
for the negative image. Thanks to this mechanism Triplet loss has been considered one of the best effective 
ways for learning face image 128-D encodings. 
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Figure 1. The architectures of (a) P-Net, (b) R-Net, and (c) O-Net, where “MP” means max pooling and 
“Conv” means convolution. The step size in convolution and pooling is | and 2, respectively 
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Figure 2. The triplet loss training 


2.4. Proposed approach 


Recognizing 


Training 


The pipeline of our face recognition system is illustrated in Figure 3. It can be further elaborated: 
Firstly, the MTCNN is trained with face images of all the staff in an organization. 
Secondly, images and video frames are input into our system and the MTCNN face detector is applied 
to recognize the face location. These faces are pre-processed and aligned based on the face landmarks 
computed by MTCNN. There are five features that are included in the face landmarks which are the 
nose, left/right eye, and left/right mouth. Moreover, the MTCNN is used as well to construct image 
pyramids corresponding to the face images. 
Thirdly, the FaceNet algorithm is applied to extract the 128D embeddings from the face images. 
Fourthly, we deploy a search algorithm to find in the database an encoding whose distance with the 
real-time face image encoding satisfies a threshold value. Once it does, the person is recognized as a 
staff. 
Consequently, the information about the presence of the recognized staff will be updated to the 
database. Otherwise, the application screens will display a notification announcing that the person’s 
face can not be recognized or it is not in the staff list. 
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Figure 3. The pipeline of our face recognition system 


Remarkably, our system allows working with Euclidean image embeddings, and the network is 


trained to propose the embedding spaces (squared L2 distances) directly according to the similar faces. As a 
result, the distance of the same subject images is small and that of different subjects is big. After these 
embedding spaces have been created, face verification can be performed easily by setting a threshold distance 
value between two points in the space. Subsequently, the SVM algorithm is applied for the classifying 
operation. 
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3. IMPLEMENTATION AND RESULTS 
3.1. Datasets used for training and testing 

In this paper, three public datasets namely labeled faces in the wild (LFW) [27], our database of 
faces (ORL) [28], and yale face database [29] were used to assess the accuracy of the in-studied approach. 
The number of face images and subjects along with their notes are given in Table 1. The three datasets vary 
largely in the number of images, subjects and configurations. Thus, it is expected that the generality of this 
study can be ensured. 


Table 1. Three in-used public face image datasets 


Name Number of face Challenges Note 
images and subjects 

LFW[27] 13,233 images Face pose, expression, illumination. Subjects with more than 20 images 
(5,749 subjects) were selected, resulting in 3,137 

images (62 subjects). 
ORL[28] 400 images Timing, expression (open/close eyes, yes/no smile), All subjects were used. Dark 
(40 subjects) illumination, accessories (yes/no glasses). background. Upright, frontal face 

images. 
Yale Face 165 images Expression (happy, neutral, sad, sleepy, surprised, All subjects were used. Grayscale 
Database[29] (15 subjects) wink), illumination (center/left/right light), GIF images. 


accessories (yes/no glasses). 


3.2. Experiment results 

Table 2 compares the accuracy of the results which were produced from the three datasets using the 
FaceNet with support vector machine (SVM) classifier. In particular, every image was processed with a 
Euclidean space technique and compared with its index label. It can be concluded that the FaceNet with SVM 
can deliver results with relatively a high level of accuracy. Figure 4 demonstrates how the triplet loss 
function minimize the distances between positive anchors and maximizes the distances between negative 
ones after being trained with the subset of the LFW dataset. 


Table 2. Accuracy comparison using Facenet with SVM 
Dataset FaceNet with SVM [%] 
LFW 99.83 
ORL 97.5 [2] 
Yale Face Database 98.9 [2] 


label 
Junichiro Koizumi 
Michael Bloomberg 
George W Bush 
Colin Powell 
Laura Bush 
Donald Rumsfeld 
ony Blair 
Hugo Chavez 
Ueyton Hewitt 
Gloria Macapagal Arroyo 
Jacques Chirac 
Andre Agassi 
Tiger Woods 
Ariel Sharon 
Megawati Sukamoputri 
John Ashcroft 
om Ridge 
Jeremy Greenstock 
Carlos Menem 
Roh Moo-hyun 
Saddam Hussein 
Vicente Fox 
Nestor Kirchner 
Alvaro Uribe 
Jennifer Lopez 
Igor Ivanov 
David Beckham 
Arnold Schwarzenegger 
Jean Chretien 
Rudolph Giuliani 
Amelie Mauresmo 
John Negroponte 
Lindsay Davenport 
Gerhard Schroeder 
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Figure 4. Triplet loss training on the subset of LFW dataset 
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The results from the FaceNet face recognition is subsequently compared with two other methods 
namely the principal component analysis (PCA) and SVM classifier [25], and the k-nearest neighbor (K-NN) 
[26], as can be seen in Table 3, the FaceNet with SourVM can deliver the minimum accuracy level of up to 
97.5%, being the highest among others. It should be noted that this model performs well even though there 
exist challenges being a variety of face poses, expressions, illumination, and the use of accessories. 


Table 3. Accuracy comparison using FaceNet with SVM, PCA with SVM, and K-NN 
Method Dataset LFW [%] ORL [%] Yale Face Database[%] 


FaceNet with SVM 99.83 97.5 [2] 98.9 [2] 
PCA with SVM 62.14 95.12 82.35 
K-NN 30.24 85.36 52.94 


Face recognition can effectively detect human presence in a particular area of interest (AOT) such as 
office, and educational institution. Herein this paper, the authors succeeded in establishing a web-based 
timekeeping application. Figure 5 illustrates how the system works. The system consists of a remote server 
and a database that can be accessed with a web application for monitoring and administrating purposes. An 
IP camera is set at the entrance to a company to streamline video frames in real-time to the Face recognition 
API. If a face is detected, an image in that time frame is preprocessed and passed on to the deep CNN to 
generate 128-byte embedding. 


< 


Server & Database 


Face recognition API 


Web monitoring Camera User 


Figure 5. Web-based timekeeping application 


Subsequently, the staff's identities can be determined with the SVM classifier and the data related to 
the staff's presence such as the identities, the accuracy percentage, the time, and the date of entry are 
recorded in the database. Figure 6 illustrates what information a user can see on the web application as a staff 
is recognized by the system. Specifically, there is a frame identifying the detected face with the recognized 
name and the accuracy at its bottom. The right side shows a list of recognized staffs along with their ID 
numbers, full names, ID cards, and the entry time. In case the system cannot recognize a person’s presence 
due to the missing of data, for example, an entry of new staff or a visitor, the face image of the person will 
display as shown in Figure 7. 

The detected face is framed with red color and labeled with “Unknown”. It should be noted that the 
time and date of the unrecognized entry are recorded to assist the administrator in preparing corresponding 
solutions such as adding the information of the new employee, and re-training the model. All the information 
about the entries of the people as in Figure 6 can be exported to *.xls file as shown in Figure 8. 

Besides, the web application has an interface for adding new facial data. Users can open the IP 
camera from the application to capture new face images in real-time. These images can then be saved in the 
database and assigned with a unique user ID. Consequently, the face from the image is extracted and labeled 
with what the administrator may find suitable, for example, the person’s name. 

The system was tested with a group of 32 staffs showing the face recognition accuracy of 96%. 
Nevertheless, the system is sensitive to the lighting changes and angle between the faces and the IP camera, 
which considerably downgrade the system’s accuracy. Thus, in case the system fails to recognize staff, the 
staff needs to inform the person in charge of timekeeping for a manual marking. 
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IDENTIFICATION 


No. Full Name 


1904 Ly Quang 
Vu 


Phan Thanh 
Trieu 


Pharm Thi 
Doan Han 


' , 
Ly QUANG VU 
91.568 


Le Van 
Xuong 


Tran Huyen 
An 


Nguyen 
Tran Son 


Figure 6. A recognized staff by the system 


Unknown 


8:11:12 24-7 


Figure 7. An unrecognized person 


4A B iS D 

1 id Fullname ID Card Time 

2 1904 Ly Quang Vu 190404020 7:51:42 24-7-2021 
3 1960 Phan Thanh Trieu 196005048 7:49:12 24-7-2021 
4 1950 Pham Thi Doan Han 195040551 7:49:05 24-7-2021 
5 1990 Le Van Xuong 199056217 7:45:22 24-7-2021 
6 1935 Tran Huyen An 193540021 7:42:51 24-7-2021 
i 10A1 Neasinwsan Tean Can ANAINENO? 7-AN-22 9A 7 7N1 


Figure 8. Data exported to *.xls file. 


4. CONCLUSION 


ID Card 


190404020 


196005048 


195040551 


199056217 


193540021 


194105983 


Time 


7:51:42 24- 
7-2021 


7:49:12 24- 
7-2021 


7:49:05 24- 
7-2021 


7:45:22 24- 
7-2021 


7:42:51 24- 
7-2021 


7:40:33 24- 
7-2021 


To conclude, in our system, the MTCNN algorithm is deployed to detect the faces, generates the 
embeddings using the pre-trained FaceNet with SVM classifier, then recognizes images that are taken 
through the system. The system is able to deliver in practice the recognition accuracy of 96%, given that the 
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images are collected under consistent conditions in terms of lighting and face-camera angle. The comparison 
study can serve as a foundation for the researchers seeking for optimized face-recognizing algorithms. 
Additionally, the paper also presents an established web-based application with some key concepts that can 
potentially be upgraded to a commercial timekeeping product. Application of such products into practice has 
proven its abilities to save companies and organizations a considerable amount and time and efforts in 
timekeeping tasks. As more and more powerful algorithms are introduced and implemented into face 
recognition systems, it is promising that end users will get more benefits from them. For future studies, the 
system can be more fine-tuned and more training data with noises can be collected to further improve the 
capability of our proposal. 
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